For my second year qualifying exams, I'm given academic papers to read by my three committee members. Each committe member has a different expertise (Computational Methods in Social Neuroscience, Memory, and Social Interaction and Faces) that allows me to learn about a wide bredth of my interests.
With approximately 20 papers per professor, I was overwhelmed on how to organize the content I was given. While some members were kind enough to organize the papers for me, others left it to me to make meaning out of what I was given. Interestingly, there was some overlap between topics and even overlap between suggested papers, which made me think to organize all of them together!
So, I turned to python to do it for me. I was aided by Brandon Rose's online tutorial: Document Clustering with Python, (http://brandonrose.org/clustering_mobile), which has a great in-depth explanation of all he did.
To begin, I used a command-line tool from xpdf called pdfottext to convert my pdfs to text files. The command is just pdftotext file
but I sped it up with a for loop and fixed the encoding for file in *.pdf; do pdftotext -enc UTF-8 "$file" ; done
.
I also did a quick renaming of the files to remember who they were reccommended by using rename: rename 's/^/Luke - /' *
import numpy as np
import pandas as pd
import nltk
import re
import os
import codecs
from sklearn import feature_extraction
import mpld3
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import glob, os
# Get all the text files in my pdf directory
path = './pdf'
text_files = [f for f in os.listdir(path) if f.endswith('.txt')]
text_files.sort()
# Load them into an array
paper_text = []
for file in text_files:
f = open(os.path.join(path, file), "r")
paper_text.append(f.read())
# Get just the paper names (remove .txt)
paper_names = [x[:-4] for x in text_files]
# Get which committe member reccommended the file (get the first word)
committee = [x.split()[0] for x in text_files]
committee_order = ['Luke', 'Thalia', 'Jeremy']
# load nltk's English stopwords as variable called 'stopwords'
stopwords = nltk.corpus.stopwords.words('english')
# load nltk's SnowballStemmer as variabled 'stemmer'
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
# here I define a tokenizer and stemmer which returns the set of stems in the text that it is passed
def tokenize_and_stem(text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
stems = [stemmer.stem(t) for t in filtered_tokens]
return stems
def tokenize_only(text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
return filtered_tokens
#not super pythonic, no, not at all.
#use extend so it's a big flat list of vocab
totalvocab_stemmed = []
totalvocab_tokenized = []
for i in paper_text:
allwords_stemmed = tokenize_and_stem(i) #for each item in 'paper_text', tokenize/stem
totalvocab_stemmed.extend(allwords_stemmed) #extend the 'totalvocab_stemmed' list
allwords_tokenized = tokenize_only(i)
totalvocab_tokenized.extend(allwords_tokenized)
from sklearn.feature_extraction.text import TfidfVectorizer
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=0.2, stop_words='english',
use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,4))
tfidf_matrix = tfidf_vectorizer.fit_transform(paper_text) #fit the vectorizer to paper_text
# (taken and modified from http://brandonrose.org/clustering_mobile)
from sklearn.cluster import KMeans
num_clusters = 10
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()
terms = tfidf_vectorizer.get_feature_names()
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
#sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
for i in range(num_clusters):
print("Cluster %d words:" % i, end='')
for ind in order_centroids[i, :10]: #replace 6 with n words per cluster
print(' %s' % vocab_frame.loc[terms[ind].split(' '), :].values.tolist()[0][0], end = ',')
print() #add whitespace
print() #add whitespace
These track well with my understanding of how these texts should be categorized. Some are about face processing, some are about social interaction and some deal with communication and memory.
By creating a clustermap, I can also see if these texts seem to cluster somehow. You'll notice that one shows a similarity of 1, and that's because its the exact same paper (a good sanity check). Down by the end, I've also noticed the three face identification papers all clump together as well!
from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)
sim = pd.DataFrame(cosine_similarity(tfidf_matrix))
sim.columns = text_files
sns.clustermap(sim.T, cmap='RdBu_r', figsize = (15,15))
Finally, I can get to my goal of categorizing these texts, and I chose to use a dendrogram to see all the associations and manually chose how many categories I would like (I ended up with 10).
from scipy.cluster.hierarchy import ward, dendrogram
import seaborn as sns
sns.set_style("dark")
linkage_matrix = ward(dist) #define the linkage_matrix using ward clustering pre-computed distances
fig, ax = plt.subplots(figsize=(8, 16)) # set size
ax = dendrogram(linkage_matrix, orientation="left", labels=paper_names,color_threshold=1.5, above_threshold_color='gray');
plt.tick_params(\
axis= 'x', # changes apply to the x-axis
which='both', # both major and minor ticks are affected
bottom='off',
top='off',
labelbottom='off')
# Recolor X Labels
ax = plt.gca()
xlbls = ax.get_ymajorticklabels()
my_palette = ['maroon', 'navy', 'darkgreen']
num=0
for lbl in xlbls:
val=committee_order.index(str(lbl).split(" ")[2].replace("'",""))
lbl.set_color(my_palette[val])
num+=1
# save figure
plt.savefig('specialist_grouping.png', dpi=200, bbox_inches='tight') #save figure as ward_clusters